Artificial Intelligence in Medicine — Latest Matching Preprints

1

Hybrid Neural--Bayesian Belief Network Framework for Uncertainty-Aware Multimodal GBM Prediction

Jayme, A.; Heuveline, V.

2026-05-13 health informatics 10.64898/2026.05.10.26352710 medRxiv

Top 0.1%

8.3%

Show abstract

Background and ObjectiveGlioblastoma outcome prediction remains difficult because clinically relevant signals are distributed across heterogeneous imaging and genomic modalities, cohorts are small, and conventional neural predictors do not quantify their own uncertainty. This study evaluates a hybrid neural-Bayesian belief network framework for uncertainty-aware multimodal glioblastoma prediction and examines how modality selection, model family, and structure-aware regularization affect predictive performance and confidence quality. MethodsThe framework was evaluated on the TCGA-GBM radiogenomic cohort using four input modalities (T1Gd, FLAIR, mRNA, and CNA), five model families, five structural-weight settings, and 15 view subsets. A secondary benchmark on the UCI Human Activity Recognition dataset was included to assess whether observed limitations were specific to the glioblastoma setting. ResultsCNA features consistently reduced performance in most multimodal settings, and selective fusion excluding CNA outperformed both the full four-view baseline and imaging-only alternatives. Model families showed clear differences in uncertainty behaviour: non-Bayesian families achieved the strongest predictive accuracy, whereas the Bayesian family achieved the lowest calibration error over a narrower confidence range. Bayesian belief network regularization produced consistent directional improvements without supporting reliable structure-discovery claims, as learned graph structures were not reproducible across folds. On the secondary bench-mark, the same framework achieved much higher predictive performance, indicating that the glioblastoma performance ceiling primarily reflects data limitations rather than an architectural constraint. ConclusionsIn small-sample radiogenomic prediction, modality choice is at least as important as model choice, and uncertainty quality differs substantially across uncertainty-aware model families. The proposed framework provides a practical basis for comparing accuracy, calibration, modality selection, and structure-aware regularization in multimodal biomedical prediction.

2

Clinical Note Comparison and Data Retrieval Via Embedding Vectors: Model Selection, Metrics, and Convergence

Dahlberg, A. C. H.; Tapiola, O.; Luisto, R.; Puranen, T.; Sanmark, E.; Vartiainen, V.

2026-05-18 health informatics 10.64898/2026.05.12.26352832 medRxiv

Top 0.1%

7.1%

Show abstract

Background: Embedding models are an integral part of generative AI architectures, transforming text into embedding vectors that represent semantic content in numerical form. Despite their central role, their performance in clinical settings remains underexplored. We evaluate embedding models across two tasks: semantic difference detection in clinical texts, and data retrieval from patient records. Methods: Eight models were applied to synthetic discharge summaries in English, Finnish, and Swedish. Semantic sensitivity was assessed by introducing controlled perturbations (deletion, modification, and paraphrasing) at three levels of severity; cosine similarity, and L1 and Euclidean distances were computed between the vectors of the original and perturbed texts. Partial vectors were compared to explore dimensionality reduction. Two models with the biggest contrast in semantic difference detection were evaluated on retrieval of relevant information from real Finnish vascular surgery records. Results: Embedding vectors captured semantic differences in clinical text: content deletion and modification produced larger increases in vector distance than paraphrasing. On average, models detected the direction of semantic change correctly, but case-level performance varied considerably. Qwen3-Embedding-8B was the only model with zero directional errors, while multilingual-E5-large erred in 13.8% of cases. In data retrieval, Qwen3-Embedding-8B again outperformed multilingual-E5-large, though the margin was narrower: sufficiency scores were 3.25 vs. 3.17 out of 5 for the first query and 2.25 vs. 1.15 out of 5 for the second query. For some models, as few as 0.6-1.2% of dimensions sufficed to replicate full-vector accuracy; principal component analysis and coordinate-level analysis did not account for this finding. Conclusions: Our results show that the choice of embedding model is important: performance differences between models can be large enough to determine whether clinically relevant information reaches the end user, and model weaknesses can be both task-specific and context-dependent.

3

Performance of Large Language Models as a Tool for Primary Care Consultations: Evaluation Study

Pascual, N.; Fernandez-Pichel, M.; Losada, D. E.; Garcia-Orosa, B.; Gude, F.; Costa Lathan, C.; Sueiro Justel, J.; Gomez Fontenla, A.; Lastra Perez, M.; Alonso Garcia,, F.

2026-05-04 health informatics 10.64898/2026.04.29.26352082 medRxiv

Top 0.1%

7.1%

Show abstract

Since the release of the first ChatGPT model in 2022, large language models (LLMs) have evolved significantly, and an increasing number of users now turn to these generative information systems for inquiries as sensitive and consequential as those related to health. The primary objective is to identify the main strengths and weaknesses of generative AI systems when responding to information needs as critical as those arising in the health domain. The study was structured using a question-answer format, in which each question corresponded to a user query and each answer represented the output generated by a model in response. The study employed a human evaluation framework involving two distinct panels of clinical experts from different specialties. The evaluation criteria encompassed three dimensions: adherence to medical consensus; presence or absence of inappropriate or incorrect information; and the potential to cause harm to users. GPT-4o mini, Llama 3, and MedLlama 3 were selected as three representative systems for the experiments. This study presents a detailed analysis of the performance of widely used contemporary large language models in addressing common health-related queries posed by online users. The results reinforce the potential of LLMs as tools for online health information seeking among non-expert users. However, the performance limitations identified underscore the need for further studies to monitor the future development of these models. Among them, performance issues have been identified in areas where users may be more vulnerable, leading to the retrieval of clinically incorrect information, particularly in matters relating to rare diseases. Furthermore, it has been noted that these models can become trapped in obsolete medical knowledge due to continuous scientific progress.

4

Generation and Evaluation of Realistic Synthetic Clinical Progress Notes for Prostate Cancer using Large Language Models.

Rey-Blanes, A.; Veredas-Morente, J.; Vivas-Vargas, E.; Gil-Garcia, F.; Moreno-Barea, F. J.; Veredas, F. J.

2026-05-28 health informatics 10.64898/2026.05.25.26354027 medRxiv

Top 0.1%

6.4%

Show abstract

Background and Objective: Access to real-world electronic health records (EHRs) remains limited by privacy, governance and annotation constraints, hindering the development of clinical natural language processing models. Realistic synthetic progress notes may provide EHR-like corpora that preserve clinically rigorous information on diagnoses, treatments, symptoms, imaging, laboratory findings and therapeutic trajectories without relying directly on sensitive patient records. This study evaluates whether large language models (LLMs) can generate realistic Spanish prostate cancer progress notes from published case reports, preserving clinical content, temporality and hospital-style conventions.

5

A Consensus-Driven Stacking Ensemble Framework for Interpretable Cardiovascular Risk Prediction and Clinical Deployment

Sozol, S. S.; Dev Nath, B. C.; Fahim, F. M. S.; Suzana, N. N.; Mirza, J. F.; Ahmmed, S.; Zohra, F.-T.; Zafr, A. H. A.; Uddin, M. N.; Mondal, M. R. H.; Hoque, A. S. M. L.

2026-05-26 health informatics 10.64898/2026.05.18.26352989 medRxiv

Top 0.2%

2.6%

Show abstract

Machine learning (ML) is being considered to help diagnose cardiovascular diseases (CVD). Still, challenges like inconsistent and limited datasets, limited infrastructure, and global inequalities lead to the need for a reliable and practicable ML solution. This paper presents an ML-driven framework for predicting CVD risk scores and classifying status. Several data preprocessing techniques, including multiple imputation by chained equations (MICE), outlier removal, are considered. In addition, hyperparameter tuning is performed with the GridSearchCV tuning technique. Moreover, a consensus-driven five-feature selection method is applied to identify optimal predictors. The dataset used in this study contains healthcare records related to future CVD risk scores, comprising 1,529 patient records with 22 features. The optimized stacked ensemble model is applied to the dataset and achieves a cross-validated coefficient of determination value of 98.13% for CVD risk score regression. Comparative evaluation with other ML models confirmed improved accuracy, efficiency, and interpretability. The explainable AI technique SHAP is applied to interpret predictions and highlight key risk factors. Moreover, a deployment-ready web platform with multi-role access has been developed that demonstrates clinical applicability. The proposed framework offers a reliable and interpretable tool for early detection of CVD and personalized risk assessment. In the future, this work can be extended to integrate longitudinal data, medical imaging, and deep learning to improve generalizability and strengthen real-world impact.

6

Does Recording Hardware Matter for Clinical Speech Recognition Evaluating ASR Performance Across Consumer Devices

Tran, B. D.; Hu, D.; Kim, S.; Guo, Y.; Mangu, R.; Reynolds, T. L.; Lafata, J. E.; Tai-Seale, M.; Zheng, K.

2026-05-22 health informatics 10.64898/2026.05.19.26353590 medRxiv

Top 0.2%

2.4%

Show abstract

Ambient clinical intelligence (ACI) systems use automatic speech recognition (ASR) to capture patient-provider conversations for downstream clinical documentation. However, many ASR evaluations are conducted under controlled conditions using specialized hardware. We evaluated how recording devices influence transcription performance of contemporary ASR engines applied to clinical dialogue. Thirty-five primary care encounters were re-enacted from transcribed conversations and recorded using five devices simultaneously: smartphone, laptop microphone, portable recorder, clip-on microphone, and a desktop microphone. Six ASR engines were evaluated using word error rate (WER), clinical concept extraction precision and recall, and sentence-level semantic similarity. Median WER ranged from 16.7% to 20.7% across engines. Engine choice produced larger variation in transcription performance than recording device, although device-related differences were statistically significant. Overall, contemporary ASR engines demonstrated relative robustness to consumer-grade recording hardware, suggesting that model selection may have greater impact on transcription performance than recording device configuration in real-world ACI deployments.

7

Privacy-Preserving Large Language Model Deployment for Oncology Registry Abstraction: Structure-Aware Evaluation in a Real-World Clinical Setting

Enikeev, R.; Moldovan, M.; Chu, M.; Amalraj, A.; Koli, P. P.; Abdul, S. S.; Sivaraj, H.; Iqbal, U.; Toh, C. K.

2026-05-21 health informatics 10.64898/2026.05.18.26353541 medRxiv

Top 0.2%

2.4%

Show abstract

Background: Structuring oncology clinical notes into registry-grade variables is essential for research and care but remains labour-intensive and error-prone. Objective: To develop and evaluate a privacy-preserving large language model pipeline for oncology registry abstraction in a real-world clinical setting. Methods: We deployed an open-source Meta Llama 3.3 70B-based pipeline to extract over 50 variables from 6,700 oncology notes at a cancer centre in Singapore. Data were de-identified locally using a Hide-In-Plain-Sight approach, ensuring no identifiable data left hospital infrastructure. Performance was assessed on 200 randomly sampled notes with adjudicated ground truth. A structure-aware framework classified outputs as correct, missing, spurious, or incorrect. Results: F1 scores were high across variables, including diagnosis (97.2%), histology (95.8%), stage (92.6%), biomarkers (91.4%), and treatments (88.1%). Transferability testing on 50 external notes showed strong performance for core variables. Conclusions: Privacy-preserving LLMs can achieve near-human-level accuracy for oncology abstraction, with structure-aware evaluation enabling more clinically meaningful assessment. Keywords: Oncology Registry Abstraction, Privacy-Preserving Deployment, Clinical Information Extraction, Structure-Aware Evaluation, Large Language Models, Template-Filling Metrics

8

DentaCoPilot: An LLM-Augmented Next-Procedure Recommender for General Dentistry, Designed for Dentist Augmentation

Rodrigues, C. C.; Rebello, S. D.

2026-05-08 dentistry and oral medicine 10.64898/2026.05.07.26352635 medRxiv

Top 0.2%

2.1%

Show abstract

BackgroundCommercial dental artificial intelligence in 2026 is over-whelmingly diagnostic: caries, calculus, periapical, and bone-level detection on radiographs. The clinically harder question that follows every diagno-sis -- given a patients chart and most recent procedure, what should the dentist do next -- remains unsolved at general-dentistry scale. The closest published system, MultiTP (Chen et al., 2024), is a CNN-RNN restricted to partial-edentulism cases and provides neither calibrated uncertainty, structured rationale, nor an evaluation that treats the model as decision support rather than as an autonomous classifier. MethodsWe introduce DentaCoPilot, a recommender that, given a structured chart, returns (i) a calibrated top-K probability distribution over Current Dental Terminology (CDT) codes for the next procedure, (ii) a verbalised confidence label, (iii) an explicit abstain flag when context is insufficient, and (iv) a chartgrounded rationale. We compare four classical baselines (frequency bigram, TF-IDF + logistic regression, XGBoost, MultiTP-style CNN-RNN) and six large-language-model (LLM) variants (Claude Haiku, Sonnet + chain-of-thought, Sonnet + retrieval, Opus + chain-of-thought, Sonnet + classical prior, Opus + classical prior) on a synthetic chart corpus of 500 patients (1,284 test examples). All LLM inference is routed through the local Anthropic Claude Code CLI; every call is logged for full audit. ResultsOn apples-to-apples evaluation, classical baselines reach 0.567 top-1 / 0.967 top-5; pure LLM variants trail at 0.267-0.467 top-1. Prompt-conditioning a Sonnet LLM on the classical baselines top-10 candidates (M5) closes the gap: top-5 rises from 0.733 (pure Sonnet + chain-of-thought) to 0.933, matching classical baselines, while preserving rationale and abstention. Increasing the LLM backbone from Sonnet to Opus does not improve accuracy with or without priming. Calibration via temperature scaling and coverage-risk analysis is reported for the baselines. ConclusionPrompt-conditioning a small LLM on a classical baselines top-K is the most cost-effective LLM design we tested for next-procedure recommendation, and the design preserves the augmentation features that distinguish the system from an autonomous classifier. A pre-registered clinician-in-the-loop evaluation at the KLE Vish-wanath Katti Institute of Dental Sciences (Belgaum, India) and a real-data evaluation on the multi-institutional BigMouth dental data repository are the next stage of work.

9

Hierarchical integration of multimodal clinical data to predict epilepsy surgery outcome

Thomas, J.; Abdallah, C.; Aung, T.; Bosque-Varela, P.; Dolezalova, I.; Parikh, P.; Wadi, L.; Jaber, K.; Kai, Z.; Ho, A.; Moye, M. K.; Minato, E.; Aron, O.; Chabardes, S.; Colnat-Coulbois, S.; Hall, J.; Klimes, P.; Minotti, L.; Dubeau, F.; Southwell, D.; Carlson, D.; Brazdil, M.; Gonzalez-Martinez, J.; Kahane, P.; Maillard, L.; Gotman, J.; Frauscher, B.

2026-05-06 neurology 10.64898/2026.05.05.26352481 medRxiv

Top 0.2%

2.1%

Show abstract

BackgroundIntegrating multimodal data into medical artificial intelligence (AI) tools and evaluating whether they outperform human experts remains a critical challenge. Epilepsy surgery offers a unique paradigm for this evaluation, as it provides an expert-independent measure (Engel score) of post-surgical outcome. Currently, evaluation for epilepsy surgery relies on the visual interpretation and human synthesis of multimodal data. While clinical evaluations are individualized and account for complex anatomical variability, integrating these diverse, high-dimensional modalities to generate a probability of surgical success remains challenging. Here, we leverage this objective outcome score to investigate the feasibility of a data-driven, phenotype-based model against the current clinical gold standard. MethodsThe evaluation was performed on an epilepsy-type controlled cohort of 57 patients from six tertiary epilepsy surgery centers who underwent resective/ablative surgery in the mesiotemporal lobe. Multimodal data, namely, patient demographics, semiology, invasive electrophysiology monitoring, and neuroimaging, were utilized. We first estimated how human experts perceive surgery success. Subsequently, we developed a data-driven model integrating these modalities to predict surgery outcomes. The model performance was compared to the current clinical gold standard (three independent human experts) and published outcome calculators. Finally, modality-level phenotypes were derived based on the models predictions. ResultsPredictions by human experts correlated poorly with post-surgical outcomes, and published outcome calculators did not perform better than the experts (DeLongs p = 0.367). Our model incorporating multimodal data achieved an area under the receiver operating characteristic curve (AUROC) of 0.801. It performed statistically better than the best human expert (DeLongs p = 0.043) and achieved a higher AUROC than the best published surgical outcome calculator (0.801 vs. 0.694). ConclusionsWe demonstrated the proof-of-concept that data-driven multimodal phenotypes can inform personalized surgery planning in epilepsy. Furthermore, we provide a framework for integrating multimodal data and benchmarking medical AI performance against human experts.

10

UPhAIR: A Hybrid Pipeline for Generating Understandable Post-hoc AI Reports in Glioma IDH Mutation Status Prediction

Gorji, A.; Shahverdi, H.; Saberi, A.; Gheiji, B.; Farahani, S.; Azemi, G.; Di Ieva, A.

2026-05-08 health informatics 10.64898/2026.05.01.26349658 medRxiv

Top 0.2%

2.1%

Show abstract

Clinical adoption of machine learning (ML) in medical imaging is limited by the lack of interpretability. To address this, we present understandable post-hoc artificial intelligence reports (UPhAIR), a pipeline designed to generate transparent, evidence-based explanations by combining Shapley additive explanation (SHAP) analysis with retrieval-augmented generation (RAG) and large language models (LLMs). We trained 12 Classifiers to predict Isocitrate dehydrogenase (IDH) mutation status in glioma using radiomics and clinical features. SHAP values were used to identify key contributors to each prediction. Domain literature was collected from three sources and indexed within a RAG framework. Relevant papers were retrieved using Facebook AI similarity search (FAISS) vector similarity search and provided to Google Gemini 2.5 Pro to generate concise, reference-supported explanations for each feature. The model achieved a best AUC of 0.90{+/-}0.02 on a 5-fold cross-validation using an extreme gradient boosting (XGBoost) Classifier and a hold-out test AUC of 0.86. In a case study of a single patient excluded from training, the model correctly predicted the patient to be IDH-wildtype glioma, and SHAP identified MGMT status, age, and three radiomic features as the most influential features. UPhAIR produced a structured report combining SHAP visualizations with LLM-generated summaries grounded in scientific evidence. UPhAIR provides a practical, model-agnostic framework that enhances ML interpretability in clinical settings, helping bridge the gap between black-box AI and real-world medical decision-making.

11

A Retrospective Evaluation of the Microsoft Healthcare Agent Orchestrator for Tumor Board Patient Summaries

Roy, J.; Korleski, J. B.; Augustin, R. C.; Yefet, L.; Jensen, Z. D.; Ehman, E. C.; Zadeh, G.; Conners, A. L.; Tevaarwerk, A. J.; Korfiatis, P.

2026-06-01 health informatics 10.64898/2026.05.22.26353812 medRxiv

Top 0.3%

1.7%

Show abstract

Background: Preparing tumor board patient summaries is time intensive. Large-language-model based systems may automate summarization but require real-world evaluation prior to clinical use. We performed an exploratory retrospective evaluation of the Microsoft Healthcare Agent Orchestrator (HAO), deployed in a Mayo Clinic controlled staged environment, to generate tumor board-style patient summaries from retrospective Electronic Health Record (EHR) notes. Methods: HAO generated summaries for breast, hepatobiliary, and neuro-oncology tumor board cases using up to the most recent 1,000 clinical notes. Clinician reviewers evaluated outputs via REDCap surveys across perceived factuality, completeness, clarity/conciseness, temporal cohesion, comparative performance, safety, and clinical utility (0-4 Likert scale). Reviewers were permitted to query the HAO chat interface to address missing details. Automated factuality was assessed using TBFact (bidirectional entailment), reporting precision and recall against available reference summaries. Results: Among 57 survey responses from 5 different physicians, mean scores exceeded 2.8 across domains, with medians of 3 for most axes. In an exploratory comparison, oncology fellows required less time to review HAO-generated summaries than to manually generate patient summaries (mean difference 13.57 minutes per patient, p<0.001), although this difference may be influenced by prior familiarity with the same cases; 96% of survey responses indicated that HAO would save time. TBFact evaluations showed higher recall than precision across domains, consistent with broad capture of reference content alongside additional content that was not present in gold-standard summaries. Attribution was viewed favorably but showed issues with primary-source specificity and link reliability. Conclusions: In a controlled Mayo environment, HAO demonstrated moderate performance and was associated with reduced review time for tumor board preparation. These findings are promising but preliminary and do not establish clinical safety, noninferiority to manual review, or readiness for routine clinical use. Limitations, including verbosity, specialty-specific content gaps, and inconsistent attribution, highlight the need for iterative refinement and further evaluation.

12

Early-Horizon Multimodal ICU Mortality Prediction Without Retraining

Bakumenko, A.; Smith, D. H.; Hoelscher, J.

2026-05-21 health informatics 10.64898/2026.05.18.26353392 medRxiv

Top 0.3%

1.7%

Show abstract

Earlier ICU mortality prediction is more clinically useful because it can identify high-risk patients while treatment decisions can still change. Yet most models are trained on data from a fixed time window, so it is unclear whether a model trained on the first 48 hours of ICU data remains reliable when used earlier in the ICU stay. We evaluated a multimodal ICU mortality model trained once at 48 hours and then applied unchanged at 6, 12, 24, and 48 hours on MIMIC-III. The model combines an LSTM for physiological time-series data, a finetuned ClinicalModernBERT model for clinical notes, and a logistic regression fusion layer. Performance remained strong at earlier time points, suggesting that useful mortality prediction is possible earlier in the ICU stay even without retraining. At 6 hours, the model achieved AUROC 0.777 and remained well-calibrated (ECE 0.038) without any recalibration, and it outperformed both single-modality models at every horizon. The multimodal benefit was most evident at earlier horizons, when physiological data were sparse: agreement between the two specialists dropped by more than half from 48 to 6 hours, while the median contribution from clinical notes increased from 37% to 49%. A Bayesian version of the fusion layer showed that uncertainty decreased for survivors as more data accumulated but remained high for non-survivors; the most uncertain cases were up to 4.9 times more likely to be non-surviving patients. Continuous hourly analyses further showed that clinical notes provide stable context between documentation events. Simply carrying forward the most recent note matched or outperformed note-decay and documentation-gap alternatives. These results suggest that a multimodal ICU mortality model trained on 48 hours of data can provide trustworthy earlier predictions without retraining, while also identifying the cases that remain hardest to interpret.

13

From Power Spectral Density to Wavelets: Improving Symbolic Representations of Electroencephalography Band Dynamics in the Weed Plot Framework

Meinardi, V.; Boyallian, C.; Giuzio, R.

2026-05-06 neurology 10.64898/2026.05.05.26352441 medRxiv

Top 0.3%

1.7%

Show abstract

Electroencephalography (EEG) interpretation in clinical practice relies on the analysis of energy distribution across standard frequency bands. The Weed Plot framework encodes band-wise spectral energy, computed using Fourier-based methods, into a symbolic representation that preserves the interpretability of traditional EEG analysis. In this study, we propose a wavelet-based extension of this framework, where the energy of predefined clinical EEG bands is estimated using the Discrete Wavelet Transform instead of Power Spectral Density. Unlike Fourier-based approaches, wavelets provide a time-frequency representation that captures transient and non-stationary dynamics while remaining consistent with clinically defined bands. From these estimates, symbolic patterns are constructed based on the relative ordering of frequency bands within short temporal windows. Their empirical distribution is used to extract entropy-based features for epilepsy detection using multiple machine learning classifiers. From an Artificial Intelligence perspective, the main contribution is a structured symbolic encoding that enhances feature discriminability. From an engineering perspective, the contribution lies in an automated framework for EEG-based epilepsy detection. Experimental results show that wavelet-based representations improve classification performance compared to raw entropy and Fourier-based features. This improvement arises from the interaction between time-frequency localization and symbolic encoding, producing more discriminative feature distributions. These findings support wavelet-based symbolic representations as a robust and interpretable framework for EEG analysis, bridging clinical interpretation and data-driven methods.

14

Benchmarking General-Purpose and Medical AI Large Language Models for Clinical Assessment and Management in Parkinson's Disease

Shechter, Y.; Klevor, R.; Kouchache, T.; Bouhadoun, S.; Postuma, R. B.

2026-05-20 neurology 10.64898/2026.05.13.26353021 medRxiv

Top 0.4%

1.5%

Show abstract

Background: The clinical applicability of large language models (LLMs) in Parkinson's disease (PD) management remains insufficiently characterized, particularly in generative responses to clinical vignette scenarios. Objective: To evaluate the quality of clinical assessments and management plans generated by a general-purpose LLM (Gemini 1.5 Pro) and a medically specialized LLM (OpenEvidence), and to compare their performance. Methods: Models generated free-text responses to 45 open clinical queries, focused on assessment of the situation, and recommended management plan. Two movement disorders fellows rated outputs using 5-point Likert scales, dichotomized into clinically appropriate ([≥]4) versus inappropriate ([≤]3). Discrepancies were adjudicated by a senior movement disorders specialist. Paired comparisons used McNemar's test; qualitative analysis examined severe errors. Results: Gemini 1.5 Pro and OpenEvidence showed high rates of clinically appropriate assessments (80.0% vs. 86.7%) but lower performance in management plans (48.9% vs. 57.8%). Cases in which both assessment and plan were clinically appropriate occurred in 46.7% and 55.6% of cases, respectively. None of these differences reached statistical significance. Severe errors were uncommon in assessments (6.7% vs. 8.9%) but more frequent in plans (26.7% in both), predominantly reflecting treatment strategy errors. Conclusions: In generative clinical reasoning tasks involving Parkinson's disease management vignettes, LLMs demonstrated reasonable performance in assessment, but consistent limitations in plan generation. The medically specialized LLM demonstrated several qualitative advantages but no statistically significant performance benefit over the general-purpose model. Therefore, these tools should be used with appropriate caution in Parkinson's disease management, particularly regarding treatment recommendations.

15

MASHA: A Multi-Agent System for Healthcare Sentiment Analysis Using AI for Migraine Detection in Arabic Tweets

Baroud, S.

2026-05-22 health informatics 10.64898/2026.05.21.26352626 medRxiv

Top 0.4%

1.5%

Show abstract

Migraine detection and sentiment analysis in healthcare have become increasingly important, particularly with the rise of social media platforms like Twitter, where users often share their personal health experiences. This study presents MASHA (Multi-Agent System for Healthcare Sentiment Analysis), an artificial intelligence (AI)-driven framework that integrates multiple machine learning (ML) models for sentiment analysis of Arabic tweets related to migraines. The system leverages a multi-agent architecture to handle tasks such as data acquisition, pre-processing, model training and real-time decision-making. Key ML models, including Support Vector Machines (SVM), Naive Bayes (NB) and Logistic Regression (LR), are integrated using ensemble techniques, leading to improved classification performance. Experiments conducted on a dataset of Arabic tweets demonstrate that MASHA outperforms traditional methods, achieving an accuracy of 90.0% and an F1-score of 89.46%. Moreover, the system's scalability and flexibility make it suitable for real-time public health monitoring, offering valuable insights into patient experiences and public sentiment regarding healthcare services. MASHA's adaptability suggests its potential application for analysing other healthcare-related conditions, reinforcing the system's scalability and broader relevance. Future work will focus on incorporating deep learning (DL) models and expanding the dataset with content from additional social media platform.

16

Impact of AI-Assisted Mammography Reading on Quality Indicators in the Czech Breast Cancer Screening Programme: A Retrospective Study

Veverkova, L.; Dolezalova, Z.; Marackova, V.; Mathew, E.; Urbankova, M.; Ambrozova, M.; Piskovsky, T.; Ngo, O.; Majek, O.

2026-05-26 oncology 10.64898/2026.05.25.26353869 medRxiv

Top 0.4%

1.3%

Show abstract

Objectives: The aim of mammographic screening is the early detection of invasive cancers. In the era of artificial intelligence (AI), this tool may improve diagnosis of earlier stages. The purpose of this study was to assess the impact on selected quality indicators retrospectively. Method: The data source was the Breast Cancer Screening Registry using data from one Screening Unit that currently uses AI routinely. The indicators of the cancer detection rate (CDR), further assessment rate (FAR), and recall rate (RR) in the year 2023, when AI was used, and the year 2022, without AI, in women aged 45-69 were compared. The statistical evaluation used the chi-square test and logistic regression adjusting for the effects of age, a woman's risk level, and the screening round at a 5% significance level. Results: In 2022, without AI, 4,034 women aged 45-69 were included, compared with 4,049 women in 2023 when AI was used. This study showed a non-significant increase in CDR from 5.0 breast cancers detected per 1,000 women (non-AI assessment) to 5.2 (AI-assisted assessment), p = 0.919; OR (95% CI): 1.034 (0.542-1.974), a significant decrease in the FAR from 5.2% to 3.9%, p < 0.001; OR (95% CI): 0.665 (0.529-0.836), and a decrease in RR from 2.4% to 1.9%, p = 0.083; OR (95% CI): 0.754 (0.548-1.037). Conclusion: AI has the potential to be a useful tool in the early detection of breast cancer by improving quality through a decrease in FAR and RR, while probably maintaining CDR.

17

Multi-Agent AI for Chest Radiography: A Sequential Segmentation and LLM-Driven Consultative Tool for Medical Training

Kurt, F.; Subasi, A.

2026-06-01 health informatics 10.64898/2026.05.29.26354432 medRxiv

Top 0.5%

0.9%

Show abstract

Background: Traditional diagnostic models lack explainability, while multimodal language models prone to hallucination remain unsafe for medical education. An interactive, risk-free artificial intelligence framework is required to serve as a reliable clinical mentor for radiology trainees. Methods: We propose a multi-agent architecture decoupling deterministic image analysis from generative consultation. Specialized computer vision models perform anatomical localization and pathological segmentation. These quantitative outputs are synthesized into a structured payload, which grounds a locally hosted large language model (LLaVA 7B) using strict prompt guardrails and prerequisite protocols. Results: The system effectively eliminates visual hallucinations by intercepting unanchored queries. The artificial intelligence tutor successfully contextualizes spatial anomalies and baseline metrics, generating accurate conversational explanations and formally structured radiology reports while strictly enforcing medical safety disclaimers. Discussion and Conclusion: By anchoring language generation exclusively to verified algorithmic realities, this framework transforms opaque diagnostic models into safe, interactive educational simulators. This establishes a highly reliable paradigm for integrating explainable artificial intelligence into medical training.

18

Relationship Extraction for Adverse Drug Events in Clinical Notes Using Large Language Models

Plasek, J. M.; Li, Y.; Amato, M. G.; Foer, D.; Seger, D. L.; Alzaidi, S.; Zhou, H.; Jackson, G. P.; Bates, D. W.; Zhou, L.

2026-06-01 health informatics 10.64898/2026.05.28.26354362 medRxiv

Top 0.5%

0.9%

Show abstract

Background: Adverse drug events (ADEs) are a critical indicator of patient safety but are often documented only in free-text clinical notes. The potential of recent advances in natural language processing (NLP), particularly generative large language models (LLMs), to identify ADEs remains understudied. This study aimed to compare the performance of multiple LLMs in identifying ADE-Drug relationships in inpatient and ambulatory clinical notes. Methods: We used clinical notes from the 2018 National NLP Clinical Challenge (n2c2) ADE dataset (inpatient; n=505) and from outpatient encounters (n=2,555) between October 1, 2018, and December 31, 2019, at a large academic medical center based in New England. Notes were pre-processed into snippets for model input. Evaluated Models included: GPT-4o, GPT-4o-mini, LLAMA 3.3-70B and their instruction fine-tuned variants (including low-rank adapters for LLAMA). Performance was assessed using both strict and relaxed evaluations (precision, recall, and F1) for all models, followed by manual evaluation (exact semantic match, partial match, missing ADE, drug mention only, not a drug, or wrong) of the two best-performing models. Results: GPT-4o and GPT-4o-mini were the top-performing models among those evaluated. GPT-4o consistently outperformed GPT-4o-mini in ADE extraction across both datasets, with higher F1-scores (0.524 vs. 0.381) and a more balanced precision-recall profile. Both models captured ADEs effectively in explicit and complex clinical contexts, although limitations included misclassification of pre-existing allergies and occasional conflation of therapeutic indications with adverse effects. GPT-4o achieved higher exact match coverage and fewer errors across clinical notes, indicating more reliable performance in both inpatient and ambulatory settings. Conclusion: This work establishes a foundation for integrating LLM methods into real-world drug safety surveillance, with direct implications for improving patient safety.

19

Cross-Model Variability in Large Language Model Triage Behavior for Potential Stroke Symptoms

Dworkis, D. A.; Stenstrom, J.; Sen, A.; Lucarelli, R. T.

2026-05-25 emergency medicine 10.64898/2026.05.22.26353904 medRxiv

Top 0.5%

0.9%

Show abstract

Background: Stroke is a time-sensitive neurological emergency in which early EMS activation and presentation to definitive care are cornerstones of effective therapy. Large language models (LLMs) are increasingly consulted by the public for medical advice, but the veracity of the guidance provided by commercially available models responding to potential stroke symptoms is not well understood. Methods: We performed a cross-model benchmarking study comparing the triage choices of three frontier LLMs (Claude Sonnet 4.6, GPT-4o, and Llama 3.3-70b-versatile) on first-person vignettes describing a unilateral arm symptom on waking, across 10 symptom descriptors, and two clinical phases (before and after a partially reassuring self-examination), with or without a clinical distractor (n=50 per condition). Results: Claude sought emergency care most often, Llama least, and GPT-4o in between, diverging most sharply in the post-examination phase where Claude called 911 in 100% of runs, Llama called for non-emergency help in 100%, and GPT-4o was symptom-dependent. A distractor shifted behavior away from emergency care in almost all conditions: calling 911 fell from 37.9% to 14.6% and waiting rose from 0% to 45.9% in the post-examination vignette. Responses were also sensitive to symptom word: weak, limp, heavy, and clumsy generated higher alarm, whereas numb, tingly, odd, strange, and weird generated less urgent responses. Conclusions: The increasing use of LLMs for medical advice has significant public health implications. Commercially available LLMs show significant model-to-model variability and framing sensitivity when confronted with potential stroke symptoms, including under-recognition of canonical CDC warning descriptors, underscoring the need for systematic benchmarking as these tools become de facto first points of contact for patients experiencing neurological emergencies.

20

Automatic Bevacizumab Response Prediction in Ovarian Cancer from Digital Pathology Images via Novel AI-based Computational Pipeline

Alsaiari, A.; Turki, T.; Taguchi, Y.-h.

2026-05-04 bioinformatics 10.64898/2026.04.29.721782 medRxiv

Top 0.5%

0.9%

Show abstract

Ovarian cancer is one of the gynecological cancer types, which, if metastasized and not detected early, can cause deaths among women. Therefore, there is a need to accurately predict drug responses to ovarian cancer. A gynecological pathologist inspects abnormality in tissues, followed by providing a report about patients; however, such a diagnostic process is (1) hard; (2) requires experience; and (3) time consuming. Moreover, existing tools are far from perfect. Hence, we present a computational pipeline to improve predicting drug response pertaining to ovarian cancer, derived as follows. First, we download digital pathology images pertaining to ovarian bevacizumab response from the cancer imaging archive repository. We employed histogram of oriented gradients to images, constructing feature vectors, provided to Fisher linear discriminant analysis to change the representation through dimensionality reduction. Then, we provide reduced-dimensionality data for regression analysis through support vector regression coupled with various kernels and calculating the area under the ROC curve (AUC). Experimental results against transformer-based models (ViT and Swin) and other deep learning (DL) models (VGG16, ResNet50, InceptionV3, MobileNetV2, and EfficientNetB6) demonstrate that our approach with radial kernel (named SVRD+R) yielded an AUC performance improvements of 17% against the best-performing transformer-based model (ViT) while obtaining an AUC performance improvements of 14.9% when compared against the best DL-based model (MobileNetV2). These results demonstrate the superiority and feasibility of our AI-based pipeline when tackling prediction problems pertaining to gynecologic cancer studies. MSC92B05; 68T09